Skip to content

FEAT: Add ActiveDirectoryMSI support for bulk copy#573

Open
bewithgaurav wants to merge 11 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi
Open

FEAT: Add ActiveDirectoryMSI support for bulk copy#573
bewithgaurav wants to merge 11 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi

Conversation

@bewithgaurav
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav commented May 12, 2026

Work Item / Issue Reference

AB#

GitHub Issue: #534


Summary

Adds Authentication=ActiveDirectoryMSI support to bulk copy. Partial fix for #534.

  • Zero-arg ManagedIdentityCredential() for system-assigned MSI.
  • ManagedIdentityCredential(client_id=UID) for user-assigned MSI. Matches ODBC convention where UID carries the identity's client_id under MSI.
  • Threads optional credential_kwargs through get_auth_token / get_raw_token / _acquire_token so future auth methods that need constructor args plug in via the same channel.
  • Credential cache key stays a plain string for zero-arg auth types and becomes a tuple (auth_type, sorted_kwargs) when kwargs are present, so different client_ids get separate cached credentials.

mssql-py-core (the Rust path used by bulk copy) doesn't itself acquire Entra tokens. Python pre-acquires a JWT and passes it as access_token. Today this works for Default, DeviceCode, and Interactive. MSI was missing, the most common Azure-hosted-service auth method.

ServicePrincipal and Password are explicitly out of scope for this PR. They need different design work and ship separately.

How user-assigned MSI survives the bulkcopy fresh-token path

Connection.__init__ calls process_connection_string, which sanitizes self.connection_str (strips UID=, Authentication=, PWD=). Bulkcopy needs a fresh token at copy time. Re-parsing self.connection_str at that point would lose UID and silently degrade user-assigned MSI to system-assigned (wrong-identity bug).

Fix shape: process_connection_string returns a 4-tuple including the captured credential_kwargs. Connection.__init__ persists _credential_kwargs alongside _auth_type. cursor.bulkcopy() reads self.connection._credential_kwargs directly. No re-parse, no chance for the sanitized string to drop the client_id.

Connection string examples

# System-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;Database=mydb;"

# User-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;UID=<client_id_guid>;Database=mydb;"

Adds Authentication=ActiveDirectoryMSI to the auth pipeline:

- Zero-arg ManagedIdentityCredential() for system-assigned MSI.
- ManagedIdentityCredential(client_id=UID) for user-assigned MSI,
  matching ODBC's convention where UID carries the identity's
  client_id under MSI.
- Threads optional credential_kwargs through get_auth_token /
  get_raw_token / _acquire_token so future auth methods that need
  constructor args (e.g. ClientSecretCredential) can plug in via
  the same channel.
- Cache key remains a plain string for zero-arg auth types and
  becomes a tuple when kwargs are present, so different client_ids
  get separate cached credentials.

Partial fix for #534. ServicePrincipal and
Password to follow as separate PRs.
Copilot AI review requested due to automatic review settings May 12, 2026 12:09
@bewithgaurav bewithgaurav changed the title Add ActiveDirectoryMSI support for bulk copy FIX: Add ActiveDirectoryMSI support for bulk copy May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for Authentication=ActiveDirectoryMSI (managed identity) token acquisition for the bulk copy (mssql-py-core) path, including user-assigned MSI via UID=<client_id>.

Changes:

  • Add AuthType.MSI (activedirectorymsi) and map it to ManagedIdentityCredential.
  • Thread optional credential_kwargs through token acquisition and update the credential-instance cache key to incorporate kwargs.
  • Attempt to extract MSI client_id from the connection string and pass it into bulk copy token acquisition; add tests and a changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mssql_python/constants.py Adds AuthType.MSI enum value.
mssql_python/auth.py Adds MSI credential support, credential kwargs plumbing, and cache key changes; adds extract_credential_kwargs.
mssql_python/cursor.py Bulk copy now tries to extract MSI credential kwargs and pass them to get_raw_token.
tests/test_008_auth.py Adds tests for MSI auth type parsing, credential construction, cache isolation, and connection-string processing.
CHANGELOG.md Documents MSI support for bulk copy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/cursor.py Outdated
Comment thread mssql_python/auth.py Outdated
Comment thread tests/test_008_auth.py
@bewithgaurav bewithgaurav changed the title FIX: Add ActiveDirectoryMSI support for bulk copy FEAT: Add ActiveDirectoryMSI support for bulk copy May 13, 2026
@bewithgaurav bewithgaurav marked this pull request as draft May 13, 2026 12:15
Connection.__init__ overwrites self.connection_str with the sanitized
(UID-stripped) string returned by process_connection_string. The original
implementation re-parsed self.connection_str at bulkcopy time via
extract_credential_kwargs, which silently dropped the user-assigned MSI
client_id and degraded to system-assigned  a wrong-identity bug.MSI

Changes:

- process_connection_string now returns a 4-tuple including the captured
  credential_kwargs so callers can persist them.
- Connection.__init__ stores _credential_kwargs alongside _auth_type.
- cursor.bulkcopy() reads self.connection._credential_kwargs instead of
  re-parsing self.connection_str.
- The public extract_credential_kwargs helper is removed (it only existed
  to support the broken re-parse path; nothing else needs it).
- black --line-length=100 reformats (CI was red).

Tests:

- test_bulkcopy_path_preserves_user_assigned_msi_client_id: invokes
  cursor.bulkcopy() with a mocked mssql_py_core, patches
  AADAuth.get_raw_token to capture the args it receives, and asserts
  the captured credential_kwargs match Connection._credential_kwargs.
  Fails if cursor reverts to re-parsing self.connection.connection_str.
- test_credential_kwargs_persisted_for_user_assigned_msi: asserts
  Connection.__init__ stores _credential_kwargs from the 4-tuple.
- test_credential_kwargs_none_for_system_assigned_msi.
- test_credential_kwargs_none_for_non_msi_auth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bewithgaurav bewithgaurav marked this pull request as ready for review May 14, 2026 12:31
@github-actions github-actions Bot added the pr-size: medium Moderate update size label May 14, 2026
Comment thread mssql_python/auth.py Outdated
identity's client_id. Returns None for system-assigned MSI.
"""
for param in parameters:
key, _, value = param.strip().partition("=")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get this from the conn str parser map? Assuming we need the conn str UID

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what will happen if someone provides a UID={hello=world}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - missed this, parser should be used as a standard across
that handles {hello=world} correctly

bewithgaurav and others added 3 commits May 14, 2026 18:40
Per @saurabh500's review: braced ODBC values like UID={hello=world} need
the canonical parser, not naive partition('='). Without this, the helper
returns '{hello=world}' verbatim and ManagedIdentityCredential rejects it.
Worse, a UID containing a literal ';' would be truncated.

_extract_msi_client_id now delegates to _ConnectionStringParser, which
handles braces, escaped '}}' inside braces, and '=' inside braced values
correctly. validate_keywords=False so the helper never raises on keys
the auth flow doesn't care about.

Tests:
- test_msi_braced_uid_value_is_unwrapped: UID={hello=world} -> 'hello=world'
- test_msi_braced_uid_with_semicolon_is_preserved: UID={abc;def;ghi}

Note: process_connection_string and extract_auth_type still use naive
split(';') for Authentication= detection across all Entra ID auth types.
That's pre-existing and tracked separately for a parser-wide refactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mssql_python/auth.py
Comment thread mssql_python/auth.py
- Debug log distinguishing user-assigned vs system-assigned MSI when the
  user passes Authentication=ActiveDirectoryMSI. Helps diagnose which
  branch was taken when token acquisition fails. Logs client_id length,
  not value (still identity material).
- Comment above _credential_cache explains the cache key shape so the
  unbounded growth is understood as a deliberate choice rather than an
  oversight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

📊 Code Coverage Report

🔥 Diff Coverage

100%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 7039 out of 27155
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/auth.py (100%)
  • mssql_python/connection.py (100%)
  • mssql_python/constants.py (100%)

Summary

  • Total: 34 lines
  • Missing: 0 lines
  • Coverage: 100%

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Connection.__init__ already parses the same connection string through
_ConnectionStringParser via _construct_connection_string (connection.py
line 253) before process_connection_string is ever called. By the time
_extract_msi_client_id runs, the input is guaranteed parseable.

The try/except was dead code. A real parse failure here would indicate
an upstream bug and should propagate, not silently degrade user-assigned
MSI to system-assigned (which is the wrong-identity failure mode this PR
exists to prevent).

Brings diff coverage to 100%.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
saurabh500
saurabh500 previously approved these changes May 14, 2026
Comment thread mssql_python/auth.py
# time we get here the input is guaranteed parseable. No defensive
# try/except: a parse failure now means a real bug upstream and should
# propagate, not silently degrade user-assigned MSI to system-assigned.
parsed = _ConnectionStringParser(validate_keywords=False)._parse(connection_string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally correct but a tiny perf hit due to double parsing of connections string.
Perfect solution would require parsing done earlier in the connect api and passing around the map to this function. That's a bigger change. I recommend a follow up PR and tracking with a GH issue.

An adhoc fix is to maintain a hashmap of connection string and uid. But that's prone to other problems esp concurrency.

bewithgaurav added a commit that referenced this pull request May 15, 2026
…tructure

Brings in the ActiveDirectoryMSI feature (PR #573, mostly approved) so SP
can use the shared module-level credential cache that PR #573 introduces,
instead of the closure-scoped cache the SP commit shipped with.

Conflicts resolved (kept both contributions):
- mssql_python/constants.py: AuthType.MSI + AuthType.SERVICE_PRINCIPAL
- mssql_python/auth.py: _parse_tenant_id + ServicePrincipalAuth +
  _extract_msi_client_id; both elif branches in process_auth_parameters;
  both keys in extract_auth_type's auth_map
- mssql_python/cursor.py: SP factory branch on _auth_type=='serviceprincipal',
  Model A branch threads credential_kwargs through AADAuth.get_raw_token
  (MSI's contract change)
- tests/test_008_auth.py: MockClientSecretCredential + MockManagedIdentityCredential
  in setup_azure_identity, both sets of TestProcessAuthParameters /
  TestExtractAuthType cases

ServicePrincipalAuth.make_token_factory now uses the shared
_credential_cache + _credential_cache_key (introduced for MSI), keyed by
('serviceprincipal', tenant_id, client_id). Closure-scoped cache and the
TODO(#573-followup) note removed. client_secret is intentionally NOT in
the cache key because credentials are looked up by identity, not secret;
a rotated secret will surface as ClientAuthenticationError, not a silent
hit on the wrong credential.

Tests: 86 passed in tests/test_008_auth.py (was 75 SP-only). Black clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# On Windows Interactive, process_connection_string returns None
# (DDBC handles auth natively), so fall back to the connection string.
self._auth_type = connection_result[2] or extract_auth_type(self.connection_str)
self._credential_kwargs = connection_result[3]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_connection_string changed from 3-tuple to 4-tuple return. This is internal so it's fine, but worth noting in the docstring that callers must unpack all 4. If a third-party or test is doing a, b, c = process_connection_string(...), they'll get a ValueError with no obvious cause. The existing tests were all updated - just flagging for awareness.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. the 4-tuple is interim. I have filed #580 to refactor process_connection_string to take the parsed connection-string map instead of the raw string (also avoids the 3x parse on the connect path). once that lands, the 4-tuple goes away and callers move to dict access.

for now, added a short docstring note about the 4-tuple to cover anyone reaching in before #580 lands.

Tracking refactor (parse-once, thread the parsed map through the auth
path) is a separate follow-up; this docstring helps anyone reaching
into the function before that lands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants